Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace validation test #204

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

joshuazh-x
Copy link
Contributor

Following up on previous discussion with @serathius and @ahrtr, I make some improvements to the Raft trace validation test:

  • Refactored Test Case: Combine the random trace generation and trace validation script into a single test case for better readability and maintainability.
  • Simplified Execution: The test can now be run with a simple make trace-validation command.
  • Enhanced Customization: The test supports customization of the working directory, number of parallel jobs, and the amount of traces to be generated (see README file).
  • Other improvements: An additional bootstrap trace has been included to simplify the TLA+ specification.

These changes aim to streamline the trace validation process and make it more efficient to automate testing.

Please let me know if you have any questions or feedback.

@serathius
Copy link
Member

serathius commented Jun 19, 2024

Please fix the CI failures to ensure timely review.

Also it would be best to CC everyone that was involved with the previous PR

@joshuazh-x joshuazh-x marked this pull request as ready for review July 3, 2024 07:59
@joshuazh-x
Copy link
Contributor Author

cc @serathius @ahrtr , here is the trace validation test with some additional improvements in TLA+ spec. The idea here is to set up a sample application based on raft, running with various faults injected. The produced traces will then be collected and validated by TLA+ trace validation. Just run make trace-validation in raft root folder.

1. Improve etcdraft spec to provide hooks for model-checking and trace validation.
2. Add new sample application based on raft and capable of injecting various faults.
3. Add new test to generate random traces (with faults) and run trace validation on them.
4. Fix some issues in etcdraft spec.

Signed-off-by: Joshua Zhang <[email protected]>
@serathius
Copy link
Member

serathius commented Jul 4, 2024

CCing everyone involved in #113
/cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv

Your input on this PR would be a huge help! The successful integration of TLA+ has the potential to significantly improve the reliability of Raft and etcd. I'd be grateful for any insights you can share.

@k8s-ci-robot
Copy link

@serathius: GitHub didn't allow me to request PR reviews from the following users: andreo, felipecrv, xiang90, zeu5, lemmy.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

CCing everyone involved in #113
/cc @xiang90 @pav-kv @ahrtr @zeu5 @lemmy @siyuanfoundation @MadhavJivrajani @andreo @felipecrv

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zeu5
Copy link

zeu5 commented Jul 6, 2024

A few changes to the TLA spec are unclear for me. I will summarise my understanding and list the concerns below.

Summary

Two new variables are added applied and durableState.applied. The two variables represent the state in memory and that persisted to the log. durableState.applied is updated on PersistState action and applied is updated on Advance. The rest of the changes in TLA revolve around ensuring sound updates to these two variables.

In the code, a new trace event Advance is introduced to ensure updates to applied variable in TLA.

Clarifications

  1. Why do we need both the variables? In the code, the only applied state tracked is in the log.
  2. Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?
  3. I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

Minor comments

This is not related to the changes in this PR but unclear to me - when a node is stopped (as a failure introduced makeNodesDown) it doesn't update any variables and is stopped. In the TLA model however, this is never tracked and updates are received to the node. So the state in the code and in the model will be out of sync? Shouldn't the model track active nodes?

@joshuazh-x
Copy link
Contributor Author

Thanks @zeu5 for the summary.

Please allow me to clarify some of my thoughts when I made this change. Please correct me if I made any mistake here.

Why do we need both the variables? In the code, the only applied state tracked is in the log.
Whenever ready is requested, the log.applied is updated. This is inconsistent with the TLA model which requires that applied is updated only after the corresponding Ready is advanced. Is there a violation found corresponding to this inconsistency?

To my understanding, log.applied is updated upon receiving MsgStorageApplyResp, indicating successful applying to the state machine. log.applying ( I assume you meant this) is updated upon ready and serves as an optimization to allow issuing new ready without waiting for complete processing of the previous one (relevant with async storage writes). Since I prioritize sync storage write in this spec so log.applied is used here for determining entries in ready to simplify the spec, Adjustments will be made when the spec covers async storage writes.

I do not see an event for PersistState in the trace logger. So the corresponding action is never executed?

PersistState isn't traced as it's external to the Raft module, but it executes as part of Next. Initially, I aimed for non-determinism (allowing PersistState and SendMessags anywhere in the ready phase), but for TLC efficiency, I eventually choose a constrained behavior enforces sequential execution of PersistState and SendMessages after Ready. So whenever a traced Ready is processed, the behavior Ready->PersistState->Ready will be enforced.

Shouldn't the model track active nodes?

The TLA+ spec focuses on state transitions. A non-functional node implies no state changes or message exchange, effectively modeled by behaviors excluding actions related to that node. The state space exploration will include behaviors representing such node downtime scenarios.

Copy link

@zeu5 zeu5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification @joshuazh-x I have requested some additional minor changes.

For further updates, I suppose it would also be good to include as part of the merge workflow some model checking on the updated TLA models to ensure the correctness of the updates. @serathius please consider this.

tla/validate-model.sh Outdated Show resolved Hide resolved
tla/etcdraft.tla Show resolved Hide resolved
tla/etcdraft.tla Show resolved Hide resolved
tla/etcdraft.tla Outdated Show resolved Hide resolved
tla/etcdraft.tla Outdated Show resolved Hide resolved
tla/etcdraft2.tla Outdated Show resolved Hide resolved
1. Remove type annotations which are not needed here.
2. Add more comments for the actions.
3. Fix simulation argument in valide-model.sh.
4. Let validate-model.sh use all cores to improve performance.

Signed-off-by: Joshua Zhang <[email protected]>
1. Decouple SpecAction and relevant definitions from the core etcd raft spec.
Put them into a middle layer spec etcdraft_control.tla
2. Fix racing issue in trace_validation_test.go.

Signed-off-by: Joshua Zhang <[email protected]>
@joshuazh-x
Copy link
Contributor Author

I'll keep each fix in a separated commit during the review. If everything goes well in the end, I'll combine them into one before final merge.

@joshuazh-x
Copy link
Contributor Author

/cc @serathius @ahrtr @zeu5 @lemmy any feedbacks for this pr?

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshuazh-x, lemmy
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr
Copy link
Member

ahrtr commented Aug 13, 2024

Please try not to submit huge PR. Please try to breakdown this PR as much as you can,

  • I see you removed the huge file tla/example.ndjson, pls do it in a separate PR and ensure the readme.md is also updated accordingly.
  • get the TLA+ changes in a separate PR, and leave it to TLA+ experts to review it.
  • get the test included in a separate PR. I think that it makes more sense to integrate the trace test with etcd's robustness test. We can discuss that after above two items are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants